3  Chapter 3: Missing data imputation in mIF imaging

Just like all data created and collected by human being, missing data is inevitable in mIF image as well. Bao et al. (2021) in their paper gave a brief summary of types of missing data in mIF image, as in Figure 3.1. Case 1 in Figure 3.1 refers to the missing of one or more entire marker channel. This type of missing data occurs but rarely, often due to low image quality. Other possible reasons for missing channel, not described in Bao et al. (2021), can be supply shortage in certain type of fluorescent material or change in research plan. Despite the rarity of case 1, there are demand for marker channel imputation with low-plex images. Due to time and financial constraints, mIF with no more than seven channels are often more feasible to obtain than the 40-channel mIFs(Wu et al. 2023). To break the restraint in obtaining cell phenotypes from few number of markers, imputation of marker channels are proposed. Case 2 Figure 3.1 occurs more frequently, when tissue wears off in the cycles of staining - wash off described in Figure 1.2.

Owning to the rapid development in the field of computer vision, all current applications in mIF imputation are implemented with machine learning and/or deep learning methods. In the three applications covered in this document today, Bao et al. (2021) uses generative adversarial networks (GANs), Wu et al. (2023) uses gradient boosting decision tree in combination with convolutional neural network, and Sims and Chang (2023) uses masked autoencoders (MAE). All methods preforms ideally well, as expected out of the maturity of machine learning methods. However, the subsequent analysis can benefit from statistical thinking in data imputation. This will be discussed further in chapter 4.

Figure 3.1: Types of missing data in mIF. Image courtesy of Bao et al. (2021).

3.1 Application case 1: Missing tissue imputation

3.1.1 Method: GANs

The fundamental version of GANs comprises of two compartments: a discriminator and a generator (Goodfellow et al. 2014). Figure 3.2 by Bok and Langr (2019) gives a brief sketch of how GANs works. Like a turn-based strategy game, the two components take turns to run an epoch. Starting with a noise distribution (usually a uniform distribution), the generator’s goal is to generate data that is close to the real data. The discriminator’s goal is to identify the real data between a mix of real data and data generated by the discriminator. With classification error feed back to generator and discriminator, both opponents update their weights: generator will try to maximize the probability that discriminator misclassify generated data as real, and the discriminator will try to maximize classification accuracy. Within infinite number of rounds, they will eventually reach a state close to equilibrium, where either party can only improve negligibly: generator generates close-to-real data, and discriminator classifies with 50% accuracy(Bok and Langr 2019). This is the point where the algorithm stops.

Figure 3.2: How GANs work. Image courtesy of Bok and Langr (2019)

One disadvantage of the original GANs is its weak control on the generated data, due to the random noise input. This disadvantage stands out especially with image synthesis. Conditional GANs (CGANs) provided a promising solution to this issue by including a condition \(\pmb y\) on both generator and discriminator (Mirza and Osindero 2014). \(\pmb y\) is usually data from the same class, for example other images in the case of image synthesis. Based on this, pix2pix is able to perform image-to-image translation by using image pairs to train the data, where one image in the pair serves as input while the other image serves as the output(Isola et al. 2017; Souza et al. 2023).

3.1.2 Application in mIF: pixN2N-HD

pixN2N-HD is a “novel multi-channel high-resolution image synthesis approach”. “N2N” represents “N-to-N”, which distinguishes itself from the widely used (N-1)-to-1 model. N represents the number of marker channels, and in the dataset used in this paper, N=11. In (N-1)-to-1 design, 10 channels are used as input and 1 channel is used as output, and this repeats for 11 permutations of models. The “N-to-N” instead uses a random gate strategy, as shown in Figure 3.3. This strategy randomly selects up to N-1=10 markers as the “missing” data. Blank images are input to the generator, where it generates image for all channels, but only imputed image for missing channel is sent to the discriminator. The discriminator will attempt to discriminate the real and fake image, similar as described above. the image input for generator also serves as the condition for the discriminator, similar to pix2pix.

This paper evaluated the model performance by comparing “N-to-N” model with “(N-1)-to-1” model and another “(N-1)-to-1 random gate” model, which blends in random gate but still needs to train 11 separate models. An index for measuring image similarity, the structure similarity index measure (SSIM) is used to assess whether “N-to-N” model generates comparable results with the other two methods (Wang et al. 2004). The result shows that all pairs of methods do not have significantly different results on a 0.05 significance level, and therefore the methods are concluded to be comparable. This “N-to-N” model take significantly less amount of time to train compared to the other methods, which is very meaningful in terms of effective computation.

Figure 3.3: Work flow of pixN2N-HD. Image courtesy of Bao et al. (2021).

3.2 Application case 2: Marker channel imputation

Both 7-UP and CyCIF panel reduction are intended for marker channel imputation, providing access to otherwise expensive high-plex (40+ channels) mIF image for study that can only obtain low-plex images. Interestingly, the two application uses very different methods for imputation.

3.2.1 Application 2.1: 7-UP

7-UP starts from a 7-plex mIF image and generates high-plex image that can identify up to 16 different cell types (Wu et al. 2023). This approach consists of three main parts:

  1. Marker panel selection. This part will select the seven markers to start with, using concrete autoencoder. Concrete autoencoder is an feature selection method, of which the loss function is the difference between the original sample and the reconstructed low-dimension sample (Balın, Abid, and Zou 2019).
  2. Morphology feature extraction. This step uses a convolutional neural network to learn the morphology features, i.e. spatial and structural features of cells. Convolutional neural networks are similar to layers of linear regressions, where there are more combinations of weights linked to each input variable.
  3. Marker expression imputation. Once the location and structure of cells are learned, the important task left is to impute the expression of each marker on each cell. The imputation is performed using XGBoost, a scalable gradient-boosting tree software (Chen and Guestrin 2016).

A series of evaluation and analysis are performed to show the validity of the method. The performance of the method is examined in three ways:

  1. Calculating the pearson correlation coefficient between the imputed marker expression and the testing data marker expression.
  2. Calculating the F1 score between the imputed and testing data cell type. F1-score is the harmonic mean of precision and sensitivity: \(2/(sensitivity^{-1}+precision^{-1})\). Cell type is generated from the marker expression through k-nearest neighbor.
  3. Patient survival status, HPV status and disease recurrence are used to further evaluate the cell type outcomes. AUC score for patient status prediction is calculated for both imputed data outcome and training data.

All evluation shows that the imputation generates comparable results with the training data, hence proven the validity of this method.

3.2.2 Application 2.2: CyCIF panel reduction

This method is intended to be an improvement from their own previous work (Ternes et al. 2022). The previous work first go through panel selection and then imputes marker channel with variatioal autoencoder. The current improved method (Sims and Chang 2023) uses masked autoencoder for image synthesis as shown is Figure 3.4. The difference is the adoption of within-model iterative selection of marker panels, as the authors believe that panel selection should be more closely tied with panel reconstruction. Starting with standard DAPI, each marker is added to the panel, predict marker intensities of other panels, and mean Spearman correlation is calculated between the predicted intensity and real intensity. The marker with highest correlation is selected, and the next round continues until the panel is constructed. The ratio of masked channels depends on tasks, though 25%~75% is a reasonable range.

The method outcome is evaluated by Spearman correlation with the true data. It is shown in the results that both MAE and the iterative panel selection outperforms the VAE and out-of-box panel selection of the previous method.

Figure 3.4: CyCIF panel reduction with autoencoder. Figure courtesy of Sims and Chang (2023).